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Abstract.  ^Ve  propose  a  few  implementations  of  the  Alternating  Direction  Method  for  solving 
parabolic  partial  differential  equations  on  multiprocessors.  A  careful  complexity  analysis  of  these 
implementations  shows  that,  contrary  to  what  is  generally  believed,  the  method  can  be  made 
highly  efficient  on  parallel  architectures  by  using  pipelining  and  variations  of  the  classical  Gaussian 
elimination  algorithm  for  solving  tridiagonal  systems. 
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1.  Introduction 

We  proDose  a  few  implementations  of  the  Alternating  Direction  Method  for  solving  parabolic 
partial  differential  equations  on  multiprocessors.  A  careful  complexity  analysis  of  these  implemen¬ 
tations  shows  that,  contrary  to  what  is  generally  believed,  the  method  can  be  made  highly  efficient 
on  parallel  architectures  by  using  pipelining  and  variations  of  the  classical  Gaussian  elimination 
algorithm  for  solving  tridiagonal  systems. 

2.  The  Alternating  Direction  Method  and  its  parallel  implementations 

We  consider  the  partial  differential  equation: 


on  the  domain  (x,y,t)  €  Cl  x  [0,T]  =  [0, 1]  x  [0,  l]  x  [0,T],  with  the  initial  and  boundary  conditions: 

u(x,2/,0)  =  u0{x,y),  V(r,j)en, 
u(x,g,l)  =  g(£,y,t),  V(i,  y)  €  3f2,  f>0, 

where  dfl  is  the  boundary  of  the  unit  square  Cl. 

A  common  approach  to  solve  the  above  problem  is  the  Alternating  Direction  Method.  First 
the  equations  are  discretized  with  respect  to  the  space  variables  x  and  y  using  a  mesh  of  n  +  1 
points  in  each  direction.  The  result  is  the  system  of  N  =  n2  ordinary  differential  equations: 

du 

=  Azu  + Bvu  (2.1) 

in  which  the  matrices  Ax  and  By  represent  the  3-point  central  difference  approximations  to  the 
operators  j;(a(x,y)&))  and  ■§i(b(x,y)-§i))  respectively. 

The  Alternating  Direction  Method  algorithm  consists  in  stepping  (2.1)  forward  in  time  alter¬ 
nately  in  the  x  and  y  directions  as  follows  [4): 


(/ -  ±AMjr)ui+*  =  (/+  ^Af£„)u' 

(2.2) 

(/-  \*tBy)ui+x  =  (I  +  ±AtAx)ui+$ . 

(2.3) 

We  observe  that  if  the  mesh  points  are  ordered  by  lines  in  the  x  direction,  then  (2.2)  constitutes 
a  set  of  n  independent  tridiagonal  systems  whose  solution  is  perfectly  parallellizable.  However. 
(2.3)  constitutes  one  tridiagonal  system  in  which  the  nonzero  diagonals  are  the  main  diagonal  and 
the  (n  +  1)**  super-  and  sub-diagoi  als.  It  is  important  to  note  that  this  second  system  can  also 
be  recast  into  a  set  of  n  independei  t  tridiagonal  systems  by  reordering  the  grid  points  by  lines, 
this  time  in  the  y  direction.  This  essentially  amounts  to  transposing  the  matrix  representing  the 
solution  at  the  n  x  n  grid  points  and  is  an  expensive  data  permutation  operation  which  is  often 
cited  as  the  main  drawback  of  Alternating  Direction  Method  in  regard  to  its  implementation  on 
parallel  machines.  The  other  difficulty  that  has  been  traditionally  associated  with  parallelizing  the 
Alternating  Direction  Method  is  that  the  classical  algorithms  for  solving  tridiagonal  systems  are 
sequential  in  nature.  Various  parallel  implementations  of  the  Alternate  Direction  Method  have 
been  discussed  by  Gannon  ai  d  Van  Rosendale  [ l)  Lambiotte  [2]  and  Ortega  and  Voigt  (3). 
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Since  it  is  the  tridiagonal  systems  that  are  usually  troublesome,  we  will  consider  the  costs  of 
only  the  two  tridiagonal  system  solutions  in  (2.2)  and  (2.3).  On  a  single  processor  each  half  step 
costs  5 n2  operations  for  the  forward  and  3n2  operations  for  the  backward  sweep.  Assuming  that 
s  arithmetic  operations  can  be  done  per  second,  the  time  for  a  half  step  on  a  single  processor  is 
approximately 

r,  =  (2.4) 


3.  Alternating  Direction  Methods  on  the  shared  memory  model  and  the  broadcast  bus  model 

In  this  section,  we  assume  that  k  processors  are  connected  to  one  large  shared  memory  with 
total  bandwidth  B  words  per  second  and  start  up  r.  Moreover,  we  assume  that  the  I/O  bandwidth 
of  each  processor  is  b.  Thus  B/b  different  processors  can  simultaneously  access  the  shared  memory. 

The  transfer  of  data  between  two  processors  is  done  by  the  first  processor  writing  to  the  shared 
memory  and  then  the  second  processor  reading  from  the  shared  memory.  Consider  a  method 
consisting  in  solving  n/k  similar  tridiagonal  systems  in  each  processor  for  (2.2)  then  transposing 
the  data  and  solving  (2.3)  as  n/k  tridiagonal  systems  in  each  processor  (it  is  assumed  that  k  <  n). 
The  total  time  required  for  performing  the  arithmetic  operations  of  one  half  step  is  approximately 

n  8n  8n2 

Ts,Arith  »  7 - =  — 

fc  s  ks 


Performing  a  transpose  of  the  n  x  n  matrix  of  the  unknown  u  requires  the  communication 


time: 


Ts.Camm  ^  2 


B/b 


K) 


b  n2 

Hence  in  the  shared  memory  model,  it  takes  a  total  time  of 


b  n2  8n2 
Ts^2k-r  +  2-  +  — 

to  perform  one  half  step  of  the  Alternating  Direction  Method. 

Assume  now  that  all  k  processors  are  linked  to  each  other  via  a  global  bus  which  allows  for 
broadcasting.  The  time  to  perform  the  arithmetic  operations  is  the  same  as  for  the  shared  memory 
model. 

To  transpose  the  matrix,  each  processor  broadcasts  in  turn  its  data  to  all  the  other  processors. 
This  requires  a  total  communication  time  of 

2  2 
__  ,  ,  n  \  .  n 

*B ,Comm  ^  “f"  )  —  KT  - f*  -j— 


Hence  the  total  time 


_  n2  8  n2 

Tb  ^  kr  +  —  +  - — 
b  ks 


for  performing  one  half  step  of  Alternating  Direction  Method. 
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square  into  4  processors. 

Observe  that  neither  of  the  above  two  models  achieves  a  reasonable  speed  up  when  k  increases 
because  of  the  limiting  term  n 2  in  the  communcation  cost.  This  is  the  source  of  the  belief  that 
Alternating  Direction  Methods  do  not  parallelize  well. 

4.  Alternating  Direction  Methods  on  a  multiprocessor  ring 

In  the  ring  architecture  the  k  processors  are  arranged  in  a  ring  and  each  processor  can  commu¬ 
nicate  with  its  two  nearest  neighbors.  It  is  assumed  throughout  that  the  processors  are  numbered 
so  that  processor  numbered  i  is  connected  to  processors  nubered  i+  1  and  i  —  1  modulo  k.  In  order 
to  efficiently  implement  the  algorithm,  we  will  first  assign  the  grid  points  such  as  to  minimize  data 
communication  costs.  The  assignment,  which  is  described  in  Figure  1  for  a  ring  of  k  —  4  processors, 
will  essentially  avoid  transposing  the  data  at  each  half  step. 

This  assignment  results  in  each  of  the  tridiagonal  systems  (2.2),  being  split  into  k  equal  parts, 
one  part  per  processor.  We  consider  the  cost  of  the  half  step  involving  the  sweep  in  the  x- direction . 
Looking  at  the  leftmost  column  of  subsquares  in  Figure  1,  each  of  the  k  processors  of  that  column 
can  start  performing  the  forward  sweeps  simultaneously  on  the  n/k  tridiagonal  systems  that  it 
holds.  Thus,  the  sweep  eliminates  the  variables  horizontally  from  left  to  right.  When  the  boundary 
of  the  first  subsquare  is  reached,  each  processor  sends  to  its  right  neighbor  in  the  figure  the  data  that 
is  necessary  for  that  processor  to  continue  its  forward  sweep  on  the  second  part  of  the  tridiagonal 
matrix,  i.e.,  the  processor  numbered  j  sends  to  its  neighbor  numbered  (j+ 1)  Mod  k  the  (j—  1)^  +  1 
to  jj  rows  of  the  tridiagonal  system.  The  forward  sweep  can  now  be  pursued  on  the  second  column 
of  subsquares,  and  this  process  is  repeated  until  the  forward  sweep  is  completed  on  the  variables 
of  the  last  column  of  subsquares.  The  backward  sweep  is  accomplished  in  a  similar  fashion  except 
that  we  proceed  from  right  to  left.  The  half  step  in  the  y  direction  is  performed  similarly,  with  the 
forward  sweep  proceeding  from  bottom  to  top  and  the  backward  sweep  from  top  to  bottom.  We 
observe  that  no  processor  is  idle  at  any  time.  It  is  clear  that  the  time  to  perform  the  arithmetic 
operations  is  again  of  the  form: 

n 8 n  8n2 

T R'Arith  &  - =  • 

ft  s  ft  $ 

When  we  cross  an  interface  in  the  forward  sweep,  we  must  transfer  the  pivot  row  plus  the 
corresponding  element  of  the  right  hand  side  of  each  of  the  n  tridiagonal  systems  in  each  processor 
numbered  j  to  the  processor  numbered  (j  + 1)  Mod  k.  These  I/O  operations  can  be  done  in  parallel 
for  each  of  the  processors  on  the  same  column.  A  total  of  k  -  1  interfaces  must  be  crossed.  For  the 
backward  sweep,  we  transfer  only  one  element  of  the  current  solution  per  system.  Therefore: 

TR'Comm  =  {k  -  1)  (r  +  — ^  +  (k  -  1)  (r  +  — )  2 (k  -  l)r  +  — . 
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Thus,  the  total  execution  time  comes  to 


lift)  =  2(*-l)r  +  £+2£. 

Clearly,  the  time  for  solving  in  the  other  direction  is  identical. 

We  observe  that  with  increasing  k  the  arithmetic  time  decreases  while  the  communication  time 
increases  linearly.  The  function  Tji[k)  has  a  minimum  which  is  achieved  for 


k0pi 


2  n 


and  the  corresponding  optimal  time  is  linear  in  n: 


Thus  by  an  appropriate  choice  of  algorithm  on  the  ring,  with  O(n)  processors  we  can  achieve  an 
O(n)  speed-up  contradicting  the  conventional  wisdom. 
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